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Abstract 


BitTorrent suffers from one fundamental problem: the 
long-term availability of content. This occurs on a massive- 
scale with 38% of torrents becoming unavailable within 
the first month. In this paper we explore this problem 
by performing two large-scale measurement studies includ- 
ing 46K torrents and 29M users. The studies go signifi- 
cantly beyond any previous work by combining per-node, 
per-torrent and system-wide observations to ascertain the 
causes, characteristics and repercussions of file unavail- 
ability. The study confirms the conclusion from previous 
works that seeders have a significant impact on both per- 
formance and availability. However, we also present some 
crucial new findings: (i) the presence of seeders is not the 
sole factor involved in file availability, (it) 23.5% of nodes 
that operate in seedless torrents can finish their downloads, 
and (tit) BitTorrent availability is discontinuous, operat- 
ing in cycles of temporary unavailability. Due to our new 
findings, we consider it is important to revisit the solution 
space; to this end, we perform large-scale trace-based sim- 
ulations to explore the potential of two abstract approaches. 


1 Introduction 


BitTorrent has become a de-facto standard for scal- 
able content distribution over the Internet. The reason for its 
success is its ability to efficiently leverage the uplink capac- 
ity of nodes whilst achieving high scalability during peak 
demands [18]. This efficiency is largely attributable to 
BitTorrent’s tit-for-tat mechanism, which encourages users 
to share their resources whilst downloading files. 


Despite the success of BitTorrent, iejstill Suffers from a 
More specifically, content that is distributed using BitTor- 
rent nD 

mple, ||/| found that the available lifes> 


1, Universidad Carlos III de Madrid ?, Lancaster University ° 


Pra - he most intuitive reason for this occurrence is that 
previously successfu i 
(seeders) leaving only users that pos- 


sess a subset of the file (leechers). Subsequently, unavail- 
ability occurs 

s. Previous re- 
search (such as [8] {7 {13]) has promoted the importance of 
seeders in regard to availability and concluded that a seed- 
less torrent is unable to reconstruct the file. However, this 
conclusion is challenged by the observation that some tor- 
rents continue to effectively serve files despite lacking any 
seeders. 

In this paper, we devote our attention to understanding 
and characterizing BitTorrent’s file unavailability problem. 
We strive to discover the scale, causes and repercussions 
of the problem alongside investigating the possible solution 


s To achieve this we have performed two large scale" 
Ga... the first investigates BitTorrent on a 
‘macroscopic level by periodically probing over SOK torent 
Si io», Whilst, the second study in- 
vestigates BitTorrent on a evel by contacting 


. To the best of the authors’ knowledge, this is 
the largest dataset in terms of size and collected informa- 
tion used to investigate file availability in BitTorrent. This 
allow us to extend previous works to obtain far more accu- 
rate results; through this we make a number of interesting 
findings, 


e 
aimhe absence of seeders. However, in 4% of cases, 


leechers can reconstruct the file without any seeders 
present. We therefore discover that 

(Sole factor involved in BitTorrent’s unavailability prob- 
lem. Such torrents achieve this through the posses- 
sion o 


gregate download rates that enable leechers to quickly 


replicate rare chunks. 


e In 64% of torrents, unavailability is not immutable 
and, instead, 6écurs in cyclic periðds followed by reoc- 
curring availability. This is due to old seeders returning 


s where they previously participated in. 


e The combination of the two previous observations re- 


sults in 23.5% of users affected by a lack of seeders _ 


e Users often become frustrated with unavailable tor- 
rents that exhibit poor download rates. We observe a 


loads thereby exacerbating unavailability, resulting in 


further abortions. 


These new findings make it crucial to revisit the solution- 
space to investigate behaviour under the new, accurate 
workload defined by our large scale dataset. As such, we 
perform trace-based simulations looking at both traditional 
single-torrent and cross-torrent mechanisms approaches to 
solving the file unavailability problem; our primary results 
are, 


° Single-torrent incentive mechanisms must encourage 


ability. 


e mechanisms cameas i 


The rest of the paper is structured as follows; Section 
2 provides related work. Section 3 then details the problem 
and our measurement methodology. Following this, Section 
4 characterises the causes and impact of unavailability. We 
discover a primary cause is a lack of seeders and therefore 
Section 5 investigates seedless states in BitTorrent. Next, 
we utilise our measurement study data to explore the poten- 
tial solution space with trace-based simulations in Section 
6. Finally, we conclude the paper in Section 7. 


2 Related Work 


BitTorrent Measurements: BitTorrent measurement 
studies can be classified int . The first 


o two different groups. 
type uses whereas the 
second n D cues to retrieve the in- 
formation from the system O7]. The first 
type of measurements is less intrusive since they do not ac- 
tively interfere with the system. However, they are often 


problematic to obtain since they require the agreement from 
content providers. The crawling techniques, on the other 


-oo a 


hand, can be divided into two categories. In its si 

form, ag¢rawler exploits the BitTorrent protocol to period- 
icall participating 
in the torrent from the tracker [2T]. This makes it possi- 


ble to study the ben ara the torrents 
under analysis. This is what we nameynacroscopie crawl- 
ing. More Sophisticated crawlers also i 


and 


We name this 
though the microscopic crawling gives more detailed infor- 
mation, it is noticeably less scalable and only allows a few 
thousand torrents to be studied in parallel {18} [19]. Each ap- 
proach is effective for addressing particular needs; however, 
these have not yet been combined to investigate BitTorrent 
in a holistic way. 


BitTorrent’s File Availability Analysis: There are only 
a few works investigating availability issues in BitTor- 
rent systems [7 [15] [13]. Neglia et al. mainly study the 
tracker/DHT availability of 22,000 torrents obtained from 
two torrent indexing sites [15]. Guo et al. extended 
this, to model the lifespan of torrents by analyzing a limited 
number of tracker traces from [15]; it was found that most) 
his model starts from the basis that 
content is unavailable when there are no seeders present in 
the swarm. This is, so far, an unverified hypothesis that is 
important to investigate. Similarly, Menasche et al. also 
use this hypothesis to investigate the availability of seeders 
in 45,000 torrents obtained from the(Mininova website (13), 
finding that 40% of swarms lack seeders for more than 15 _ 
days in the first month after the torrent’s birth. 


Improving BitTorrent File Availability: Surprisingly, 
little research work has been performed into addressing the 
file availability in BitTorrent [20]. The most recent 
work improves file availability problem in BitTorrent by file 
bundling to enlarge the online times of the users. Using 
a queuing theoretic model and controlled experiments on 
PlanetLab, the authors show that this approach can reduce 
waiting-time for peers in torrents with highly unavailable 
seeders. However, their results consider that peers arrive 
in a constant Poisson process which is a strong assumption 
given the measurement results presented in and also 
in this paper. 

Guo et al. were the first to propose intriguing ideas 
and results for cross-torrent collaboration. Amongst other 
things, the authors sketch an abstract mechanism for instant 
inter-torrent collaboration; following this they also evaluate 
the principles. Yang et al. propose a variation of these ideas 
by designing a cross-torrent tit-for-tat strategy that assumes 
repeated interactions of the users. However, this method 
suffers because as Piatek et al. show through extensive mea- 
surements, i 


will never meet again at any later point in time (17). Pi- 


atek et al. subsequently propose an alternative protocol that 
enables in BitTorrent 


3 Problem Background and Methodology 
3.1 Defining File Availability 


To study and understand the availability of files in Bit- 
Torrent we first present a . Let’s assume that 
we have a torrent T, formed by ‘W nodes, managing the 


download of a s. Thus, we can 


define the vector that contains the 
infomation aot the pies red by pet i Vij = lif 
node i has the piece J; V;; = 0 if node 7 does not have piece 
j. Viis typically known as the bitfield of node i.. 


We define the 
T at a time instant t as 


of torrent 


(1) 


Where ij) represents the logical OR-operation 
over the piece j across all the nodes in the torrent T. 


3.2 The Circumstances of Unavailability 


It is important to understand in which circumstances a 
file becomes unavailable, based on our definition. A fileis) 


cessibleqwithinjajswaim. This situation arises if there are 


no peers in the swarm that possess a given piece or, alterna- 
tively, if the peer(s) that possess the piece are inaccessible 
(e.g. due to firewalls, NAT or overlay graph disconnection). 
It is intuitive to consider the former as a far more likely 
circumstance (e.g. most BitTorrent clients implement tech- 
niques such as NAT traversal (14). Moreover, they include 
neighbors discovery techniques such as the Peer Exchange 
Protocol -PEX- and periodical tracker polling that prevent 
graph disconnection). Therefore, given this assumption, @ 


e With an accessible seeder, a file is available 
e Without an accessible seeder, a file may be available 


This paper uses these two observations as a starting point 
to investigate unavailability in BitTorrent. In the following 
sections, we denote time periods in a torrent’s lifecycle in 
which no seeder is online as a seedless state. To this end, 
the file is unavailable if torrent T is in seedless state and 
U(T) <1. 


coupicaleztsely fidheroondinon titam= Without de- 
tailed analysis, we can therefore currently state that: 


3.3 Measurement Methodology 


To study the unavailability problem and specifically the 
seeders’ role in it, we have performed two large-scale mea- 
surement studies using microscopic and macroscopic crawl- 
ing. To the best of our knowledge, this paper is the first to 
combine both microscopic and macroscopic crawling tech- 
niques to better understand BitTorrent (specifically BitTor- 
rent’s file availability). 

Microscopic Crawling: To truly understand unavailabil- 
ity in BitTorrent, it is necessary to be able to view the micro- 
scopic characteristics of any given swarm, e.g. piece distri- 
bution or nodes’ download rates. Without this, one can only 
get a rough estimation of availability using metrics such as 
the number of seeders. The information regarding the be- 
haviour of individual peers provides the necessary data to 
make new, more accurate findings. To gain this information 
we developed and deployed a distributed 


20 Hodes in the Emulab testbed [5]. 

The crawler operated from July 18, 2009 to July 29, 
2009 (micros~—1) and then again from August 19, 2009 
to September 5, 2009 (micros-—2). 


1 (PEX). 


For 
the micros-1 study, the crawler followed 255 torrents ap- 
pearing on Mininovd!| after the first measurement hour; in 
these torrents, we observed 246,750 users. The micros—2 
dataset contains information from 577 torrents and 531,089 
users. 

Macroscopic Crawling: The microscopic measurements 
provide detailed insight into the distribution of pieces and 
download rates within the swarm, as well as between differ- 
ent peers. However, due to scalability issues it is difficult to 
perform such detailed measurements on a very large-scale 
(e.g. several thousand torrents). To complement these re- 
sults we therefore also implemente 
t web- 
site after December 09, 2008 for a period of 38 days. This 
crawler p i 


(we were able to systematically collect 98% of all the ip ad- 
dresses from within the swarms). This study allowed us to 
gain an extremely large number of measurements regarding 
details such as 

. This information can subsequently be cor- 
related with our smaller-scale microscopic measurements 
to derive such things as the scale of seedless states and 
the causes for seedless states occurring. Our final macro- 


'The largest BitTorrent Community based on Alexa Ranking. 


scopic dataset consisted of reports from 46,227 torrents and 
29,066,139 users. 


4 Characterising Unavailability: Causes and 
Impact 


In this section, we first investigate the role that seeders 
play in file unavailability. Following this we study the ex- 
ceptions and variations we discovered. Lastly, we then in- 
vestigate the real-time impact that a lack of seeders has on 
client performance and their subsequent reactions that can 
be observed. 


4.1 Investigating the Role of Seeders in 
File Unavailability 


It is intuitive to think that U(T) < 1 in a torrent with- 
out any seeder (that is, leechers are unable to reconstruct 
the file). However, this is, so far, an unverified assumption 
that must be investigated (and quantified). To ascertain this, 
we inspect the (i) nodes’ bitfield and (ii) nodes’ download 
rates in all the torrents of our microscopic traces affected by 
seedless states. 


4.1.1 Bitfield Analysis 


We have collected every nodes’ bitfields for all the torrents 
in our microscopic measurements as they have evolved over 
time. For each torrent we have computed U(T) periodically 
every 10 minutes during any period a torrent is without any 
seeders (i.e. it is in a seedless state). This allows us to ascer- 
tain whether a full copy of the file exists in the torrent at any 
given time. Fig. [I|shows the CDF of max(U(T)) observed 
in the seedless state for each torrent that we studied. From 
this data, we can extract two pieces of information; first, 
in the majority of cases\(86%) our hypothesis is confirmed 
and the contacted - 
t (ie. max(U(T)) < 1). 
Clearly, this means that seeders do have a significant im- 
pact on the availability of files in BitTorrent. Importantly, 
however, we also find that a notable proportion of torrents 


(14%) actually remain available even without a seeder. Col- 


lectively, this makes up 24%qoffallsleechersithatyoperatem 


. This is a crucial finding that has not 
been observed before; it is therefore in contrast with previ- 
ous models that consider all seedless torrents to be 
unavailable. 


4.1.2 Download Rate Analysis 


A limitation of the bitfield analysis is that not all nodes are 
accessible due to NATs. To address this, we also inspect 
the aggregate torrent download rates. Through this, we can 


Cumulative Fraction of Seedless States 
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Piece Availability max(U(T)) 


Figure 1. Piece availability in torrents af- 
fected by seedless states. 


i 

t 

KBps. From this we can derive that the node cannot find 
any new pieces to download. 

To highlight our findings, we first inspect a representa- 
tive torrent from our microscopic trac] shown in Fig. 
The figure shows the median instant download rate of the 
online leechers over time, sampled every 10 minutes. It 
also plots the number of seeders and leechers, as well as 
the number of copies of the least replicated piece. Note that 
when the number of seeders becomes 0, the torrent enters a 
seedless state. 

The torrent can be observed to enter a seedless state af- 
ter the middle of day 3, remaining in this state for roughly 


two days. When the final seed departs the download rate of 

A Teee O Ba 
incides with the number of least- 

E Stes It can therefore be con- 


fidently inferred that the file is, indeed, unavailable during 
this period due to the departure of the last seeder. 
Interestingly, it can also be seen that the torrent be- 


comes available again during day 5. Asthe seeders return) 


again. In contrast to past assumptions, it is therefore evi- 
dent that uaa PROP RTD This important 


phenomenon will be investigated further in Section|5.3] 
The above analysis has inspected a representative tor- 
rent. To validate its widespread applicability we also look 
at the download rate degradation in all torrents. To achieve 
this, we have taken all the users that have been affected by 
a seedless state and separated their downloading time into 
two periods: (2) periods in which they have suffered from 
a seedless state and (27) periods in which they have not. 
Fig. |3|/presents the download rate distribution for both pe- 
riods. First, we can observe that the download rate in a 


2We have observed the same behaviour in most of the torrents affected 
by seedless states. 
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Figure 2. Snapshot from a torrent in our mi- 
croscopic trace. 


non-seedless state is much higher than in a seedless state. 


t, indicating 


that the peers cannot locate any required pieces and the file 
is, indeed, unavailable. Second, however, we also observe 
that intai 

This can be at- 
tributed to two reasons: (7) the aforementionedd4%rofitors 


fate. This can be observed in the representative torrent (cf. 
Fig. B): between days 4 and 5 there is a peak in the number 
of leechers which results in a short peak in the download 
rate as new comers download the available pieces. 


4.2 Investigating the Causes of Swarm 
Resilience 


The previous section has identified a notable percentage 
(14%) of torrents that can maintain availability even with- 
out any seeders; this represents 24% of all leechers that en- 
counter seedless states. 


seedless state (micros-1) 
seedless state (micros-2) 
non-seedless state (micros-1) 
non-seedless state (micros-2 
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Figure 3. Interval download rates for nodes 
affected by the lack of seeders. 


To investigate this, we separate torrents into those that 
S ) and 
those that We then investi- 
gate quantitative properties of these two groups to ascertain 
how they differ at various points in their lifecycles. All of 
the identified metrics have been calculated for each group 
(across all member torrents) every 10 minutes using infor- 
mation from the microscopic traces and the tracker reports. 
These values have then been averaged together over each 
time period investigated. 

Table[I] gives an overview of all metrics used in this anal- 
ysis. We calculate these over two time periods: the begin- 
ning of the torrents’ lifecycle and just before the last seeder 
goes offline. Although not included in the table, we also 
investigated the effects of file size and content type with- 
out ascertaining any correlation. Most metrics are straight- 
forward, however, two require some explanation: Distribu- 
Gon Entropy (£ (T)) and the Churn Factor (CF). 
The F(T) i 


the swarm; this is 
We there- 


fore characterize the distribution entropy in a torrent T at a 


Recall that 
Wis the R 20d; is the 
“itfieldof node i. This index is similar to Jain’s Fairness 


Index [11] and achieves a yalucwofplsifjallpiecesjarerequally 


defines the 


This factor is 


@ithin & by default t = 10 mins. 


Metric Time before seedless state Time after torrent’s birth 


T hour 6 hours 6 hours 24 hours 

Resilient Susceptible Resilient Susceptible Resilient Susceptible Resilient Susceptible 
Swarm speed (in KBps) 58.83 23.88 68.58 24.99 95.72 53.50 62.99 44.82 
Seeder/Leecher ratio 0.15 0.03 0.14 0.04 0.19 0.19 0.32 0.43 
Firewalled/NATed peers (in %) 70.86 61.09 62.30 60.67 51.30 56.05 54.94 58.44 
Distribution Entropy E(T) 0.93 0.94 0.92 0.93 0.91 0.90 0.92 0.91 
Least replicated piece (# of copies) 9.21 1.61 8.38 2.82 15.34 14.21 21.33 23.55 
Churn factor CF’ 0.03 0.21 0.08 0.15 0.07 0.11 0.08 0.07 
Online leechers 281.34 111.61 250.48 101.55 134.05 82.71 163.12 89.55 


Online seeders 5.28 1.51 6.11 1.80 12.05 11.25 18.17 21.34 


Table 1. Characteristics of resilient torrents (those that maintain availability in seedless state) and 
susceptible torrents (those that cannot reconstruct the file). 


From the data in Table |1| we can make the following 
important observations, 


e Torrent Popularity: Fromthe beginning, resilient tor 
T. i s. 


Cumulative fraction of hosts 


affected (micros-2) 


non-affected (micros-2 


; if this coin- 
cides with the loss of a seeder then it becomes impos- 
sible to recover these pieces again until a seeder re- . : 
türns: Resilient torrents have significantly lower churn Figure 4. Comparison of download perfor- 

mance between peers affected and not af- 
ajahan fected by seedless states. 
e 
Seeder/Leecher Ratio: Resilient torrents exhibit a 
higher seeder/leecher ratiojand, as a derivative of this, 
experience download rates that over twice as high result in unavailability. This section now investigates the ef- 


as susceptible torrents. This superior performance is) fects that this has on both individual users and wider system 
highly beneficial for the survival of piece replicas as performance. Three stages can be identified which we now 


Download rate (in KBps) 


s. Be- discuss. 
fore seedless state occurring, resilient torrents there- E 
fore have many more replicas of the rarest piece when b 


compared to susceptible torrents. To extend the earlier analysis, we now compare the average 


In summary, these results show tha ili i download rate of users that suffer from a seedless state at 
s od some point during their download against the average down- 


download rates due to beneficial seeder/leecher ratios. The load rate of users that always find content available. We first 
combination of these factors results in i i categorise users into two groups: affected vs. non-affected. 
| The first group of users consists of leechers that are (at some 


ble counterparts. This makes such swarms highly resilient point) affected by a lack of seeders. The non-affected users, 
to the loss of any seeders. Importantly, it also can be con- on the other hand, have at least one seeder available dur- 
cluded that aa EGOS TS TOURS Ging ing their entire download. Fig. [4] gives the download rate 
i s (e.g. piece selection) butyin=) distribution of both user groups as obtained from the two 

i microscopic crawlings. Whereas the median download rate 

is is exemplified by the lack of any correlation for the non-affected users is 36 KBps in micros-1 and 48 
between resilience and distribution entropy. KBps in micros-2, the performance for peers attempting 
to download unavailable content is only 0.06 KBps and 3.8 


4.3 Effects and Trends of Seed Departure KBps, respectively. 
The second observable stage is a direct derivative of the 


Despite a notable percentage of torrents surviving with- decrease in download performance. Spi vaN 
out seeders, it is evident that the loss of all seeders can often EEEE T study this 


S 


@ ~m 


we examine the session times in our microscopic traces. We 

observe that 89% of users affected by file unavailability (i.e. 

participating in susceptible torrents) abort their downloads 
Sadly, (thissi 


GilesavailableagaiiIn contrast to these results, Gsers/op=) 


34.47%, Although this seems initially high, we also a 


that many users operating in other torrents that do not suffer 
from unavailability also abort their downloads. On closer 


inspection, these ‘unnecessary’ abortions occur if torrents 


The third stage in this process is the worrying emergence 

of a hain reaction. We find that 
i This re- 
sults in an exacerbation of the torrent’s unavailability and a 
further drop in download rates for those trying to access the 
remaining chunks. As other users witness this trend, they 


too abort their downloads. This process results infewery 


and therefore greater unavailabil- 

ity and more abortions. Frequently, the above two repercus- 

sions of unavailability and the creation of this chain reaction 
t. 


From these findings we derive that users are highly sen- 


5 Characterising Seedless States 


The previous section has validated and quantified the im- 
portance of seeders in regard to file availability in BitTorrent 
and discussed under which circumstances a file of a seedless 
torrent becomes unavailable. It has been found that in the 
majority of torrents (86%), the loss of all seeders results in 
unavailability. In this section we therefore investigate the 
behaviour of seeders and characterise the nature of seed- 
less states using our large scale dataset. We first look at the 
frequency of seedless states in BitTorrent. Following this 
we investigate the causes of seedless states before, finally, 
investigating the issue of why torrents can become revived 
again after extended periods of unavailability. 


5.1 How Prevalent are Seedless States? 


To quantify how prevalent seedless states are in BitTor- 
rent, we ask the following question: how many torrents and 
to what extent are torrents affected by seedless states?. To 
answer this, we use the logs from our macroscopic trace that 
give us a large-scale view on the system comprising of 46k 
torrents. 


User 


Availibility * Unavailibility ` 
Period Period 


Figure 5. Illustration of a seedless state. 


The measurements show that more than38% of torrents)» 
(17,568 out of 46,227) loseutheiniseeders! within thelfirsy 


month out of which 72% lack seeders after only 5 days. 
Similarly, we find that 
To 


exemplify the scale of this, å 
fi i 


n in our study, more than 9.68 million users (33% 


of all users seen) participated in torrents with highly un- _ 
available seeders suggesting that this is not only a long tail _ 


giles Out of these users, more than 1.59 million were 
irectly affected by seedless states. 


5.2 Why do Seedless States Occur? 


Since seedless states are highly prevalent in real swarms, 
an intuitive question is: why do they occur in the wild? In 
this section, we first identify and then further investigate 
the influencing factors responsible for triggering seedless 
states. 


5.2.1 Identifying Influencing Factors 


There are two main factors that directly influence the exis- 


tence of seedless states: (i) i 
(iz) t . To illustrate the influ- 


encing factors, we use a simple example shown in Fig. |5| In 
this figure, each horizontal line represents the lifetime of a 
user; these users can either be in a leecher state (thin lines) 
or a seeding state (thick lines). 


It seems a that ee oO O O O o 
y (as demonstrated later on d 


Let’s now assume that user n is the last available seeder 
in our example torrent and none of the previous seeders re- 
turn to the torrent. In this case, a seedless state occurs when 


the time required for leechers to download the file exceeds 
the online time of the last seeder. For example, Fig. [5]shows 
that after the last available seeder leaves the swarm at time 
t3, none of the remaining leechers were able to finish the 
download. If we focus on the n-th node and its subsequent 
successor in the torrent (n + 1-th), the inter-arrival time be- 
tween both users is given by Tn+1(= t2 — tı) whereas the 
seeding time of node n is given by un. Assume that both 
users n and n+1 download a file of size F, with rate D, and 


Dr respectively. UMMA aeieeRit Astle 


To simplify the analysis, we assume that D,, = Dn+ f] 
In this case, the seedless state is reached if the inter-arrival — 
Sapam lreenthannieseedingsime. 

To summarise, seeding times as well as inter-arrival 
times play an important role in the generation of seedless 
states and subsequently in the long-term availability of con- 
tent. Since both parameters are not directly correlated, we 
individually analyse both of them in the following. 


(3) 


5.2.2 Arrival Behaviour of Users 


The first behavioural characteristic that is paramount to 
seedless state generation is the inter-arrival times of users. 
In this regard, intuitive questions are: (i) what inter-arrival 
times do we expect in reality and (ii) how do inter-arrival 
times evolve over time? 

By analysing a few hundred torrents in a small com- 
munity, previous work [7] has shown that user inter-arrival 
times are exponentially increasing. Our goal is to generalize 
this finding for ‘open’ communities such as Mininova.org 
that are orders of magnitude larger. For our analysis, we use 
similar techniques as applied in [7]. We consider all torrents 
in our macroscopic trace. We use linear regression to fit the 
logarithm of the complementary{‘of the number of node ar- 
rivals of each torrent along time. Let X; denote the comple- 
mentary number of node arrivals at time epocht and Y; be 
the fitting result. We define the relative deviation of the ac- 
tual node arrivals over an ideally exponentially increasing 
function by erry eae Thus, a relative deviation of 0% 
indicates that both curves overlap. Fig 6] shows the devia- 
tion for each torrent of our macroscopic trace. The x-axis 
depicts the torrents ordered by ascending population size 
while the y-axis shows the relative deviation. For most of 
the torrents, the relative deviation is less than 10% whereas 


3 Our microscopic measurements show that the download rate of users 
that finish downloads (Dn in the example) is higher than the download rate 
of those that do not (D,,+1) validating our assumption. 

4We use the complementary number of node arrivals to avoid domains 
in which the logarithm is undefined, e.g., epochs with no peer arrivals. 
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Figure 6. Deviation from linear regression. 
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Figure 7. Maximum inter-arrival times of tor- 
rents with highly unavailable seeders. 


the deviation tends to decrease with increasing torrent pop- 
ularity. Altogether, the average relative deviation of all tor- 
rents is 4.8%. Therefore, we conclude that ¢helinterarrival 
time of the nodes exponentially increases with time. 

Notably, we observed especially high inter-arrival times 
in torrents affected by seedless States; this is in line with our 
analysis in the previous section. For instance, Fig[7| plots 
the maximum inter-arrival time observed in these torrents 
with unavailable seeders. More than 45% of the torrents 
exhibit inter-arrival times far beyond 10 hours. 


5.2.3 Seeding Times of Users 


The second behavioural characteristic that is paramount to 
the creation of seedless states is the seeding time of a node, 
i.e. how long seeds stay online for. As already shown in 
our example torrent (cf. Fig. 5). to maintain file availability 
it is necessary for seeders to remain online for long enough 
for new seeds to be generated. Fig. [B] shows the cumula- 
tive distribution of the seeding times of the nodes obtained 


from the two microscopic measurements. It can be seen 
thats Ri ESTATE Gone TV EED 
IMS When this 


data is compared to the inter-arrival time of users it can be 


identified that the GurreneSeeding times im Bit lorent arent 
long-term file availability in BitTorrent. 


5.3 How long are Seedless States? 


The representative snapshot presented in Fig/2}has high- 
lighted that torrents can become available again after a ex- 
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Figure 8. Seeding time distribution. 


tended periods of unavailability. In this section we validate 
(using both the macroscopic and microscopic datasets) that 
file unavailability is, in fact, discontinuous with reoccurring 
periods of temporary availability. Through our measure- 


ment studies we can state that this occurs u 


«tiGipated in. This allows the 


T (i.e. in sus- 
ceptible torrents) : 


Alongside the existence of resilient torrents, this means that 
23.5% of all leechers affected by seedless states can actually 
still gain access to the file. 


The reoccurrence of seeders happens in over 64% of tor- 
rents that suffer from seedless states in our macroscopic 
study. To investigate this, Fig. P]shows the CDFs of both 
the duration of seedless states as well as the duration of 
the subsequent periods in which content becomes avail- 
able again, computed over all torrents exhibiting this phe- 
nomenon. Note that the x-axis is in log scale. It can be ob- 
served that seedless periods are typically long-lasting with 
an average of 43.19 hours whereas the subsequent availabil- 
ity periods only last 12.56 hours on average. 


s (e.g. Vuze, uTorrent) t 


file download. Unfortunately, BitTorrent users do not have 
permanent identifiers and thus we cannot make quantitative 
statements on exactly how many unique seeders rejoin a 
s iod. However, the length of 
the seedless periods as depicted in Fig. p] offers a conser- 
vative bound for the inter-seeding time distribution of such 
users. This is obviously very coarse grained and therefore 
we expect the inter-seeding times actually to be higher. 


The reoccurrence of seeders is obviously in contrast to 
previous work that has assumed unavailability is continuous 
and immutable. To investigate the impact that this finding 
has on previous work, we briefly look at the relative devi- 
ation that the assumption has when compared against our 
dataset. We define the relative deviation as, 


temp. availability ——— 
seedless state 
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Figure 9. Length of cycles of temporary avail- 
ability and unavailability periods. 


measured avail. time — assumed avail. time 


(4) 
We find that in approximately 35 % of the torrents in our 
macroscopic dataset, the assumption works well. In these 
torrents, we did not observe any temporary availability pe- 
riod after the torrent first enters a seedless state. However, 
in 50% of the torrents, the content is actually available for 
at least twice as that assumed when considering immutable 
unavailability. 


relative deviation = — 
assumed avail. time 


6 Improving File Availability 


The previous sections have outlined the file availability 
problem and highlighted the significant impact that seeders 
have on this. We therefore deduce that a solution must find 
some way to i - 


ter they have obtained it themselves i.e. to 
fromm leaving torrents To do this we highlight two possible 


approaches: single-torrent and cross-torrent mechanisms. 
The principles of these two approaches are first abstractly 
outlined to show how each might improve seeding times. 
Following this, the two approaches are evaluated consider- 
ing the key findings described in previous section (e.g. reoc- 
currence of seeders). For this purpose, we run trace-based 
simulations using the workload from our large scale dataset. 


6.1 Potential Solution Approaches for Ex- 
tending Seeding Times 


This section briefly outlines the two generic approaches 
that can be taken for improving seeding in BitTorrent. The 
first is using traditional single-torrent principles whilst the 
second exploits the concept of cross-torrent collaboration 
(originally outlined in [7]). Note that we do not offer con- 
crete implementational details; instead, we provide a brief 
outline of the principles behind each mechanism. 


6.1.1 Single-Torrent Solution 


A single-torrent solution involves incentivising users to re- 
main within a torrent to seed, based on certain properties 
related to that individual torrent. As of yet we do not know 
of any successful mechanisms to achieve this due to the dif-) 


therefore consider a sim- 


eos . e 
ple framework of — m, that may work. Such a 


solution would involve 


fibutedwithimtheswarm. The tracker would be responsi- 


ble for managing this encryption and, as such, would be the 
source of the keys. Subsequently, once a peer has down- 


loaded the file it would be regūired to remain seeding fora 


a 


6.1.2 Cross-Torrent Solution 


A cross-torrent solution involves incentivising users to 
cooperate with the system as opposed to individual tor- 
rents. This approach is motivated by observations from our 
macroscopic trace that shows 
4.98 on average). We have further found that 
joi ere- 
fore providing conclusive evidence that the same peers re- 
join the BitTorrent system multiple times w. 
To highlight the 
principles of a cross-torrent solution, imagine a user who 
joins torrent X at some point in time and completes the 
download; this user may very well join another torrent Y 
at a later point in time. When the node comes online again 
to download torrent Y it could then theoretically persists as 
a replica for torrent X. 


The incentives behind this could be managed in a number 
of ways (e.g. [17]). The following example highlights how 
the system could work using persistent contribution histo- 


ries. Though this oes the system would maintain 
to which torrent the contribution is made). Subsequently, 


h . 
This would therefore Te 


i 
: © 
6.2 Experimental Methodology 


To evaluate the two possible solutions approaches, the 

al. [2] is used and ex- 

tended to enable the simulation of multiple torrents existing 
in parallel. 
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6.2.1 Evaluative Aims 


We do not aim to perform an implementational comparison 
between vanilla BitTorrent and the proposed approaches, 
e.g., regarding protocol overhead and technical aspects to 
realize either approach. This is out of the scope of this pa- 
per. The goal of our evaluation is to shed light on the fea- 
sibility and potential of the two approaches based on the 
newly discovered observations from our studies. 

For both approaches we wish to discover, (i) does the 
approach increase file availability in torrents with ordinar- 
ily unavailable seeders, and (ii) what are the implications of 
this in regard to download performance. We aim to inves- 
tigate these factors on both a per-torrent and system-wide 
basis to explore how the effects of the approaches impact 
both perspectives on BitTorrent. 


6.2.2 Input to the experiments 


Selecting the Torrents: Our trace data encompasses tens 
of thousands of torrents over a period of several weeks, far 


. Hence, we chose 
a random subset o 


fected by seedless states with varying file sizes between 3- 


1500 MB and a per-torrent monitoring period of at least four 
weeks} The logs of these torrents contains data of more 
than 235,000 downloads. 

User behavior: To model the access pattern of torrents, we 
do not use any artificial peer arrival function. Instead, we 


Gea. To model the number of swarms that 


a peer joins we calculate the 


à : 9 Finally, after fin- 


ishing their downloads, d 
cf. 


Fig.[8). 

Speed distributions: To have a representative band- 
width distribution, we first associate each IP address with a 
country, using a freely available geolocation database [12]. 


Based op posun of origin, the Ookla databas® [16] 


Failures in contribution histories: To represent informa- 
tion inconsistencies in the distribution of contribution his- 
tories (e.g. due to churn), when encountering a new user in 
the cross-torrent approach, the contribution history is only 


>We have also experimented with higher/smaller amount of torrents. 
Due to space constraints, we opt for presenting only a representative sam- 
ple. 

6We find through simulations that 36 hours is enough time to get a 
download success ratio over 99% in the presence of seeders for all access 
links and file sizes used in our experiments. 

7We have also experimented with other datasets [9] [4] and obtained 
similar results. 


Protocol Metric 
Avg. seeding Ds Fs 

time (in hours) (in KBps) (in %) 
BT: Vanilla 3.44 137.84 20.25 
ST: 2x seeding 6.88 158.11 13.65 
ST: 5x seeding 17.20 179.41 4.39 
ST: 10x seeding 34.40 190.81 0.66 
CT: Persistent history 3.44 138.75 0.13 


Table 2. Overview about system-level results. 


known with a probability of 0.9. This represents a worst- 
case scenario, as the literature has reported a superior accu- 
racy of 0.96% [I7]. 


6.2.3 Performance metrics 


We utilise two performance Metrics to evaluate the effec- 


tiveness of the approaches. The first is th 

l s (D) and the second is the 
fracti (F). Both metrics are cal- 
culated on a per-torrent (Dr, Fr) and system-wide basis 
(Ds, Fs). 


6.3 Comparative Results 


Table[2| gives an overview of the three variants in which 
the measured seeding times are lengthened by a factor of 
either!2, Sor 10%) For comparability reasons, wejassumied 
for the cross-torrent approach i i 
t 
The results presented in this table summarize the fraction of 
download abortions (Fs) and the average downloading rate 
(Ds) on a system-wide level. Some points worth noting: 


e In the chosen set of torrents, 20% 0f downloads were” 


t. 

e To maintain persistent file availability in the single- 
.g. to ensure a system wide success 
h 


rate for downloads, 


. This induces 
an 
e The cross=tomentyapproach achieves a i uea F 


. How- 


ever, this is achieved 


With regard to download rates, the cross- 
torrent approach also performs similar to vanilla Bit- 
Torrent. 


In addition to this table, Fig. and Fig. sepa- 
rately plot the fraction of aborted downloads and the aver- 
age downloading rate, respectively, on a torrent basis. It can 


E r 
° BT: Vanilla —— |, 
= ST: 10: ding C) 
~ 
g 
BS 
5 
10 20 30 40 50 60 70 80 90 100 
Torrents 
E = 
a BT: Vanilla 
a CT: Persistent histor 
4 
g 
g 
$=] 
= 
10 20 30 40 50 60 70 80 90 100 
Torrents 


Figure 10. Overview about the per-torrent 
abortion ratios. 


BI: 
CT: Persistent histor 


Download Rate D (in KBps) 
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Figure 11. Average download rates on a per- 
torrent basis. 


be observed that, in a few torrents, even a 10 times multipli- 


is 102.26 KBps for vanilla BitTorrent and 121.65 KBps for 


the cross-torrent variant. As this is an ont tere eto 


mance|basedjon popularity. Whilst not shown in the plots, 


the 
i erefore, 


43.55% of the users that finish in vanilla BitTorrent gai 
performance increase Of/88%pwhen applyin 


collaboration. However, the remaining 66.44%) of these 


users suffer from an average performance decrease of 22%. 
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6.4 Summary 


To conclude, even when considering the seeding dis- 
continuity of users, the si 

i to ensure a system-wide file avail- 

ability of >99%, average seeding times of more than 34 


hours are required. In TE oe o A 


ity/easily. It therefore allows the 20% of users that could 
not complete their downloads in vanilla BitTorrent to ef- 
fectively download the file. The download performance of 
the cross-torrent approach is on a system level equivalent 
to vanilla BitTorrent, and even improves download times on 
a per-torrent basis. Although this finding is at first glance 
surprising, the cross-torrent approach benefits from the sig- 
nificant increase in nodes (>20%) which now find content 
available. This, in turn, allows users that previously could 
not access the content to download at high rates; this com- 
pensates for the inherent performance disadvantages of peer 
selection policies that are optimised for fairness [6]. 

However, it must also be noted that the increase of file 
availability due to cross-torrent collaboration is achieved by 
a notable trade-off. That is, the download rate of more than 
half of the users that finish downloads with vanilla BitTor- 
rent accounting degrades by 22%. This can be contrasted 
with a 88% improvement for the remainder of peers. 


7 Conclusions 


This paper has investigated BitTorrent’s unavailability 
problem in the wild and explored the feasibility of the po- 
tential solution-space. To achieve this, two large-scale mea- 
surements studies were performed to ascertain the charac- 
teristics, causes and repercussions of file unavailability in 
BitTorrent. Based on this, we made a number of interesting 
findings that offer the most accurate study of file availabil- 
ity in BitTorrent so far. Most notably, it was found that (i) 
a lack of seeders often results in unavailability but not al- 
ways, (ii) the churn level, the fast replication of rare chunks 
and the population size largely defines a swarm’s ability to 
survive without a seeder (iii) unavailability usually occurs 
in cyclic periods with intermittent availability, and (iv) un- 
availability often results in a chain effect that leads to future 
download failures. 

Due to these new findings, the solution-space was also 
investigated to see how they affect both single and cross tor- 
rent solutions. It was found that the continuance of BitTor- 
rent’s single-torrent mechanisms can only address the prob- 
lem with a 10 fold increase in seeding times. In contrast, 
great potential has been found in using the cross-torrent ap- 
proach which maintains current performance levels whilst 
also achieving over 99% availability. 
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